European Soccer Strikers’ Performance (EDA)

by Erdem Koç

Data used in this EDA is from here

Univariate Plots Section

A first look at the data shows that we have 13 columns and 7062 observations. Some of these columns are numeric, some are categorical. A good example for this project. Let’s start exploring European Footballs Strikers!

Some summary information

## 'data.frame':    7062 obs. of  13 variables:
##  $ age           : int  26 25 22 26 26 26 23 27 21 30 ...
##  $ current.club  : Factor w/ 193 levels "AC Milan","ACF Fiorentina",..: 193 193 193 193 193 161 161 161 160 161 ...
##  $ current.league: Factor w/ 10 levels "Eredivisie","Jupiler Pro League",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ foot          : Factor w/ 4 levels "","both","left",..: 4 2 2 4 4 2 4 3 3 4 ...
##  $ height        : int  183 180 179 183 188 174 175 178 180 183 ...
##  $ name          : Factor w/ 1173 levels "€?der","€?douard Duplan",..: 273 318 1013 49 114 919 899 674 1170 690 ...
##  $ nationality   : Factor w/ 285 levels "Albania","Albania  Bulgaria",..: 225 11 11 225 225 196 39 212 225 39 ...
##  $ position      : Factor w/ 2 levels "CF","W": 2 2 1 1 1 2 2 2 2 1 ...
##  $ season        : Factor w/ 6 levels "2012-13","2013-14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ assists       : int  1 NA NA 5 0 11 NA 5 NA 9 ...
##  $ games         : int  22 NA NA 26 5 42 NA 40 NA 32 ...
##  $ goals         : int  3 NA NA 13 1 17 NA 2 NA 15 ...
##  $ minutes       : int  1057 NA NA 2287 171 3678 NA 3588 NA 2455 ...
##       age                  current.club             current.league
##  Min.   :16.00   Sparta Rotterdam:  66   LaLiga            : 828  
##  1st Qu.:22.00   AS Monaco       :  60   Eredivisie        : 792  
##  Median :25.00   Royal Antwerp FC:  60   Ligue 1           : 780  
##  Mean   :25.44   Akhmat Grozny   :  54   Premier League    : 756  
##  3rd Qu.:28.00   AOK Kerkyra     :  54   Jupiler Pro League: 690  
##  Max.   :38.00   Asteras Tripolis:  54   Sۻper Lig        : 690  
##                  (Other)         :6714   (Other)           :2526  
##     foot          height                     name           nationality  
##       : 318   Min.   :163.0   Wanderson        :  18   Spain      : 504  
##  both : 714   1st Qu.:176.0   Leandrinho       :  12   Brazil     : 372  
##  left :1350   Median :180.0   William          :  12   Italy      : 294  
##  right:4680   Mean   :180.6   €?der            :   6   Russia     : 270  
##               3rd Qu.:185.0   €?douard Duplan  :   6   Greece     : 264  
##               Max.   :204.0   €?mer Ali Sahiner:   6   Netherlands: 228  
##               NA's   :108     (Other)          :7002   (Other)    :5130  
##  position      season        assists           games      
##  CF:3420   2012-13:1177   Min.   : 0.000   Min.   : 0.00  
##  W :3642   2013-14:1177   1st Qu.: 1.000   1st Qu.:18.00  
##            2014-15:1177   Median : 2.000   Median :27.00  
##            2015-16:1177   Mean   : 3.367   Mean   :26.18  
##            2016-17:1177   3rd Qu.: 5.000   3rd Qu.:35.00  
##            2017-18:1177   Max.   :31.000   Max.   :66.00  
##                           NA's   :949      NA's   :949    
##      goals           minutes    
##  Min.   : 0.000   Min.   :   0  
##  1st Qu.: 2.000   1st Qu.: 878  
##  Median : 5.000   Median :1662  
##  Mean   : 6.671   Mean   :1694  
##  3rd Qu.: 9.000   3rd Qu.:2428  
##  Max.   :61.000   Max.   :5060  
##  NA's   :949      NA's   :949

Distribution of nr. of goals scored by players

Since goal is the ultimate “goal” of the game :), lets take a look at the histograms.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   5.000   6.671   9.000  61.000     949

There are 2 interesting points here:

  1. it is likely that most of the strikers do not score a goal, hence we have peaks at 0 goals

  2. Goals scored has a right skewed charactheristics

Distribution of nr. of assists made by players

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   2.000   3.367   5.000  31.000     949

Distribution of assists are similar to goals, and the reasoning behind is more or less the same. However let’s keep in mind that generally number of assists is 1/2 wrt goals. This is an expected result as assists mostly come from midfielders not strikers.

Distirbution of minutes played

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     878    1662    1694    2428    5060     949

Note that there is a peak close to 0 minutes. This may be due to several factors. Substitution, injuries, imperfection in data collection etc.

Distribution of nr. of games played by players

Distribution of games played is similar to the minutes, but the peak is a bit to the right. This is because of the obviouds correlation btw minutes and games played and the fact that whether a player plays 90 mins or 1 mins, games played is incremented.

Structure of data from seasons

Obviously there is more data after the season 2013-14. This suggests that we should look at the frequencies of games played, so that we can normalise the effect coming from more data as the seasons come close to present day.

Review note: * Note that total number of samples per each season is not equal. To be able to compare the histograms I divide samples by total samples. So that area under each curve is 1 and every X,Y value is comparable.

Note that, for seasons 2012-13 and 2013-14 number of zero minutes and games is relatively high. We must keep this in mind while investigating a feature over seasons.

Age and preferred foot

Below I provide some more univariate information just to keep in mind on the following parts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   22.00   25.00   25.44   28.00   38.00

Age histogram of players are as expected. They can become professional at the age of 16 and tend to retire after they are 30 years old.

%70 of the player in data are right footed, %20 is left and %10 can use both feet.

Computed variables

I beleive following variables are interesting to look at:

  1. Goals/minute
  2. Goals/game
  3. Minutes/Game

Lets add them to the dataset and plot histograms.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0017  0.0032  0.0036  0.0050  0.0417    1018

Note that we filter goalsperminute = 0 cases, as this histogram is more relevant “if” goal is scored and a peak at 0 makes it hard to see actual charactheristics when one or more goal is scored.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0968  0.1905  0.2279  0.3243  2.0000    1016

Note again that we filter no goal cases. It is also important to note that max(goalpergame) = 2.0 does not mean that a player cannot score more than 2 goals per game. It is the ratio that players can reach given the goals they scored and the games thay played that season.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   49.40   63.24   60.93   74.93  120.00    1016

This histogram is again as expected. It is obvious that a football game is 90 minutes and 120 minute is not an outlier. It is due to the fact that some games can go to extra time. In fact we are looking at the leagues, not tournamets and this should not normally occur, but I will keep this as it is, because there may be a playy-off case or a similar game that takes 120 mins. Since it is very rare it will not be very critical in ou conclusions.

Univariate Analysis

What is the structure of your dataset?

Data includes 7062 observations with 13 variables. 7 of these are factor variables. I added 3 computed variables from other columns.

Factor variables:

  • current.club : 193 levels
  • current.league: 10 levels
  • foot : 3 levels “both”,“left”, “right
  • name : 1173 levels
  • nationality : 285 levels
  • position : 2 levels “CF”,“W”
  • season : 6 levels “2012-13”,“2013-14”…“2017-18”

Other observations:

  • there was no need to ordered factors except for seasons. (alphabetical ordering was sufficient for all practical purposes)

  • 2012-13 and 2013-14 seasons include a little bit more 0 minutes and games per player. This may be due to data collection method and/or because the player get more experienced and become more regular starters, hence take more minutes/play in more games.

What is/are the main feature(s) of interest in your dataset?

The number of goals a player scores.

I would like to investigate if the time a player takes leads to higher goals. Is there a correlation btw assists and goals a player makes. How the goal performance of players is affected from other possible factors like league, position etc.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I expect that to be related to the time a player takes on the pitch and maybe position and in which league he plays. It is also interesting to see the affect of age. It is possible that more experienced players reach higher goal rates.

Did you create any new variables from existing variables in the dataset?

I created the following variables and added them to the dataframe:

  1. Goals/minute
  2. Goals/game
  3. Minutes/Game

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The original data was wide, as the goals/assists/minutes/games were given as seperate columns for each season. I did a union like (of SQL) operation to append these and added a year column to make it easier to plot histograms across all seasons.

It was also not logical to look at goals/minute and goals/game when there is no goal scored. This was leading to very high peaks at 0 on histograms and making it hard to observe the distirbution characteristics of these varibles. So, while plotting the histograms of calculated variables I filtered the case where there is no goal.

I also observed that amount of NaNs in calculated columns increased as it is possible to have 0 minutes/games and this was leading to NaN. My histogram code is already filering this by “!is.na(my_variable)” so it was OK as it is.

Bivariate Plots Section

Below is a scatter matrix via ggpairs() function to give us a general look.

Review Note: I agree that values are mixed here, but merely a corelation plot would not be sufficent here, because it helped me to see box plots of categorical variables like position.I am using this plot as a very rough start point.

## `geom_smooth()` using method = 'gam'

As expected, as the time that a player takes on pitch increases, total goals he scored increases.

Note also that, relationship is more of exponential especially after 3250 minutes (blue) on the pitch. Although variance increases at this region, this non-linear increase is not a coincidence. This is because more skilled players play almost every game. That is, they get more game time and they are better at scoring. These are the affects that create exponential relationship as we go beyond 3000 minutes.

This plot is also interesting. Note that, there is almost a horizonral line until 10 games. So it can be observed that, if a player plays in more than 10 games in a season, his goal scoring performance will be better (very roughly linearly proportional to the nr of games he plays). This may be due to the fact that, players who play less than 10 games per season are usually substitutes and/or less talented players.

Here we observed that as the player get more experienced total goals per season increases, but data becomes more noisy for ages < 20 and > 30 due to the decrease of samples. i.e. less footballers of these ages.

This suggests that, there is not much relation between the age of a player and the minutes he play per season. I would expect a little bit more correlation here, which was interesting to see that there is not.

This suggests that central forwars (CF) can on avarge score more goals per season wrt wingers. This is again an expected result. What about the assists they make?

And here it is. Wingers tend to make more assists compated to center forwards. This is again expected as they play on sides of the pitch and usually cross in appropriate situations so that CFs can score. CFs however tend to directly search for goal and unless the team is playing with 2 forwards their assist options will be less.

Now at first sight… there is not much to see here. But if we focus on outliers La Liga has higher outliers. If you are a football fan, you may have already started to think about Lionel Messi and Chirstiano Ronaldo. Let’s see who are these la liga guys?

##                   name goals  season
## 2452 Cristiano Ronaldo    61 2014-15
## 94        Lionel Messi    60 2012-13
## 3627      Luis Su€ðrez    59 2015-16
## 2448      Lionel Messi    58 2014-15
## 98   Cristiano Ronaldo    55 2012-13
## 4802      Lionel Messi    54 2016-17
## 1275 Cristiano Ronaldo    51 2013-14
## 3629 Cristiano Ronaldo    51 2015-16

And you were right! Only with the exception of Luis Suarez in season 2015-16 where he also scored 59 goals.

Most efficient Leagues, Players and Clubs

It is also interesting to look at the efficieny of goal scoring. We define efficiency by mean value of goals per minute by league/club/player. i.e. how many goals per minute a player/club/league scores.

It is important to point out that I preferred to take only 2017-18 season for these computations. This is because, we have the “current team” column due to the fact that some players change teams every season. Structure of data is not appropriate to group over seasons for such computations.

Below are some interesting stats.

## # A tibble: 10 x 3
##    current.league      Mean     n
##    <fct>              <dbl> <int>
##  1 Sۻper Lig         0.366   115
##  2 Eredivisie         0.323   132
##  3 Serie A            0.319   112
##  4 Premier League     0.311   126
##  5 Jupiler Pro League 0.303   115
##  6 Ligue 1            0.301   130
##  7 LaLiga             0.300   138
##  8 Liga NOS           0.263   111
##  9 Super League       0.233   106
## 10 Premier Liga       0.218    92

The most surprising thing here was to see that Turkish Süper League is the place where the most efficient strikers play! This season, there occurs 0.37 goals for every 90 minutes of football played in the league. (Note: I multiplied goalsperminute*90 so that we can talk about goals per game which is more intiutive. But, this should not be confused by our actual “goalspergame”" column)

## # A tibble: 6 x 4
## # Groups:   name [6]
##   name              current.club   goals_per_minute total_minutes
##   <fct>             <fct>                     <dbl>         <int>
## 1 Baf€÷timbi Gomis  Galatasaray SK             1.06          2369
## 2 Jonas             SL Benfica                 1.06          2881
## 3 Ciro Immobile     SS Lazio                   1.06          2882
## 4 Burak Yilmaz      Trabzonspor                1.05          1536
## 5 Cristiano Ronaldo Real Madrid                1.03          2877
## 6 Rangelo Janga     KAA Gent                   1.01          1787

Here it is interesting to see the top 3. (Note: I filtered the results by taking players who played at least 900 minutes this season. ie. 10 effective game time)

## # A tibble: 6 x 5
## # Groups:   current.club [6]
##   current.club                current.league      Mean total_minutes     n
##   <fct>                       <fct>              <dbl>         <int> <int>
## 1 UC Sampdoria                Serie A            0.816          4323     3
## 2 "Vit€\u0081ria Set€¦bal FC" Liga NOS           0.729          3044     4
## 3 Paris Saint-Germain         Ligue 1            0.658         13046     5
## 4 Waasland-Beveren            Jupiler Pro League 0.647          9070     8
## 5 Trabzonspor                 Sۻper Lig         0.615          5385     5
## 6 Tottenham Hotspur           Premier League     0.562          8137     5

Here we see there is not a direct bias from league, club or player in terms of effectiveness. Since number of teams, players change in each grouping we see different

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Our primary feature of interest was the total number of goals scored by strikers. We observed that there is a roughly linear relationship with minutes played per season and goals scored. We also observed that this relationship becomes exponential as the minutes exceed 3250 minutes (35 games). This is because, players who play more than this much of games are exceptionally good players with high consistency.

Another strong relation that effected the total number of goals scored per season was the position. In this data there is ony two positions were given: CF, W. I observed that CF has higher mean wrt W.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes. I observed that wingers (W) tend to make more assits per season wrt forwards (CF).

I also oberved that the following relationships:

  • La Liga Strikers Lionell Messi and C. Ronaldo are the reason of the outliers of total goals per season which are higher than other leagues.
  • In terms of goal scoring efficiency (goalsperminute) the top league of season 2017-18 is Serie-A
  • In terms of goal scoring efficiency (goalsperminute) the top player of season 2017-18 is Bafetimbi Gomis
  • In terms of goal scoring efficiency (goalsperminute) the top club of season 2017-18 is UC Sampdoria

What was the strongest relationship you found?

The number of goals per season is highly correlated with position and total minutes on pitch.

Multivariate Plots Section

Relationship of goals, minutes/games and position

Note how CFs paint the upper part of the histogram. It is very obvious that their goal efficiency is higher.

When we break it to leagues, we see more or less same behaviour. No league stands out visually.

When we brea it int seasons again nothing stands out visually, except 2017-18 season. This is normal as it is not finished yet.

The seperaiton of positions is even more clear when we look at games played vs goals scored.

Breking into leagues does not reveal a lot again. Russian Premier Liga and Greek Super League are a bit behind others in terms of maximum golas scored. They obivously have less efficient strikers.

Relationship of assists, minutes/games and position

Now, let’s look at the assits the same way we did for goals:

The most important observation is that, assists are less seperated wrt position.

Serie A is a bit different here. I will add more detailed explanation of this in reflection part.

Age does not seem to be correlated in any visible sense with goals scored.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The most interesting strengthening feature was the position. It was clearly observed that central forwards (CF) are more involved in goals scored wrt wingers (W).

Were there any interesting or surprising interactions between features?

Effect of position in assists made was a bit in favor of wingers as expected, but the differentiation was not as clear as the effect of position on goals scored.

Also it was an iteresting observation to see that in Serie A (Italy) center forwards are much less involved in assists wrt other leagues of Europe.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

N/A


Final Plots and Summary

Plot One

Description One

Our main feature of exploration was the amount of goals scored per season. This histogram shows the most general charactheristics of strikers’ goal performance in football game. For European leagues (and I would strongly expect other continents be very similar) we found out that a striker scores on avarage 6.7 goals per season. Considering 3rd quantile is 9 goals, we can assume a striker is a decent one if he scores > 9 goals per season.

Plot Two

Description Two

With the information provided in dataset, it turned out that a strong feature to affect total goals per season was the total minutes a player stayed on the pitch. Above scatter plot shows that until 3000 minutes of game time strikers total goals per season is linearlt proportional to the minutes taken.

However, a better fit reveals another interesting finding beyond 3000 minutes (approx. 30 games). Goal scoring performance of players who play more than 30 games per season is increasing exponentially. As I stated above, this is an interesting finging for me, but nonethless explainable. Because, very rare and skilled players play that much games per season and not only the minutes they take, but also their exceptional skills become more distinguishable.

Plot Three

Description Three

After the hint given in scatter matrix, it was obvious that position was very relevant in the amonut of total goals scored per season. Therefore, it was necessary to further investigate the scatter plot of goals per minute wrt position. Color coding wrt position revealed even more interesting and clarifying results. We saw that center forwards contribute to the upper part of the scatter plot with clear difference.


Reflection

As a football fan and enthusiastic data scientist “candidate”, it was very entertaining and educative to go through this EDA. While doing my own data set search, I encountered this dataset; that was a good choice in terms of structure but still needed to be a little bit modified in order to be able to work smoothly with R studio. Thanks to previous python (pandas) skills we learned, it was easy to shape data to be explored.

I really did not know what to expect from this data. I was not even sure how correct it was collected. The first dissapointing thing was that, Bundesliga (German League) was not involved. I guess their statistics are subject to copyrigt etc. Second suspicion I still have is the relative high amount of 0 goals and assists for the oldest seasons. I preferred to either filter these values or did use the most recent season depending on the content of analysis.

Other than that every step was either jaw dropping (at least for me :) ) or a mathematical confirmation of the intiutive expectation of an 36 year old football fan as myself. Follwing are my highlights:

I think to include only wingers and central forward lacks a bit of completeness. It could be logical to include attacking midfileders too, but then I think it would be harder to see such a nice scatter graph when we colored wrt position. For future analysis, it may be interesting to look at assists vs goals and define a more elaborate metric of a so called " complete" striker whose contirbution to his team is event more than just scoring goals.